0. Introduction

The objective of this analysis is to reduce the number and duration of the absences of the company employees. To achieve that, we will work with a dataset created by the HR department with all the recent absences together with its duration and a set of features that they consider important. They have collected 593 observations on a period of xx time.

The steps followed are: - Identify the most important features and their relation/impact with long absences. During this stage, we cleaned the dataset, do some plots to better understand the data and get insights to finally perform some features engineering and features selection. - Provide a Machine Learning model to identify beforehand absences that might be prolonged more than 5 hours. This is a Classification Problem and we will optimize our model for Accuracy. Considering the aplication of the model we decided this is a good evaluation variable over Precision or Recall.

This is an iterative process, every time we do some Feature engineering and we run the model again and so on until we find the most accurate model. On the script, you will see different transformations that have been done to the dataset (bucketizing, grouping obs with low freq, creating new features, etc.). Some of them, improved the model and others havent, nevertheless we kept them all with the detailed explanation of our observations and conclusions. the actions performed might be trigered by insights on the data visualizations section or by general knowledge on this field.

During the analysis we will be calling the observations with prolonged absence (> 5 hours) as possitive or ones while those with short absences are called negative or zeros.


1. Ground preparation

1.1 Load libraries used during the analysis.

1.2 Load our data and set our working directory. Our data is sepparated into training data set and testing.

rm(list = ls()) # Errase all variables from memory

#path<- "/Users/estevezmelgarejoramon/Desktop/ML2/FirstAssignment/Absenteeism"
#setwd(path)

training <- read.csv(file = file.path("Absenteeism_at_work_classification_training.csv"),  header = TRUE, dec = ".", sep = ";")
test <- read.csv(file = file.path("Absenteeism_at_work_classification_test.csv"),  header = TRUE, dec = ".", sep = ";")
testId <- read.csv(file = file.path("Absenteeism_at_work_classification_test.csv"),  header = TRUE, dec = ".", sep = ";") %>% select_at("ID")
set.seed(123)

1.2 Our functions

1.3 Function Creation

We have created a series of functions that compute the models we will be using in this assignment aswell as a function that will compare results among different well known models to define the most suitable one. Due to the size and composition of the dataset and the kind of analysis we need to do, we will test the following models: Linear Model Lasso Ridge Decission Tree XGBoost RandomForest

lm.model <- function(training_dataset) {
  
  # Create a training control configuration that applies a 5-fold cross validation
  train_control_config <- trainControl(method = "repeatedcv", 
                                       number = 5, 
                                       repeats = 1,
                                       returnResamp = "all",
                                       verboseIter = F,
                                       sampling = "up")
  
  # Fit a glm model to the input training data
  set.seed(123)
  this.model <- train(Absenteeism ~ .,
                      data = training_dataset, #the dataset is what will change 
                      method = "glm", 
                      metric = "Accuracy",
                      preProc = c("center", "scale"),
                      trControl=train_control_config)
  
  return(this.model)
}

Glmnet.model <- function(training_dataset) {
  trainControlLasso <- trainControl(method = "repeatedcv", 
                                    number = 5, 
                                    repeats = 1,
                                    returnResamp = "all",
                                    verboseIter = F,
                                    sampling = "up")
  
  gridGlmnet<- expand.grid(alpha = seq(0, 1, by = 0.01), 
                         lambda = seq(0.001, 0.05, by = 0.01)) 
  
  set.seed(123)  
  this.model <- train(Absenteeism ~ ., data = training_dataset, 
                           method = "glmnet", 
                           metric = "Accuracy",
                           preProc = c("center", "scale"),
                           trControl = trainControlLasso,
                           tuneGrid = gridGlmnet
                           ) 
  
  
  return(this.model)
}

Tree.model <-  function(training_dataset){
  trainControlTree <- trainControl(method = "repeatedcv", 
                                   number = 5, 
                                   repeats = 1,
                                   returnResamp = "all",
                                   verboseIter = F,
                                   sampling = "up")

  
  
  set.seed(123)  
  this.model <- train(Absenteeism ~ ., data = training_dataset, 
                     method = "rpart", 
                     metric = "Accuracy",
                     preProc = c("center", "scale"),
                     trControl = trainControlTree,
                     tuneLength = 200 # for random search
  ) 
  return(this.model)
}

XGB.model <-  function(training_dataset){
  
  trainControlXGB <- trainControl(method = "repeatedcv", 
                                  number = 5, 
                                  repeats = 1,
                                  returnResamp = "all",
                                  verboseIter = F,
                                  allowParallel = TRUE,
                                  sampling = "up")
                                  
  registerDoMC(cores=4)
  set.seed(123)  
  this.model <- train(Absenteeism ~ ., data = training_dataset, 
                    method = "xgbTree", 
                    metric = "Accuracy",
                    preProc = c("center", "scale"),
                    trControl = trainControlXGB,
                    tuneLength = 20, # for random search
                    num.threads = 4
                    ) 
  
  return(this.model)
}

RandomForest.model <-  function(training_dataset){
  
  trainControlRandomForest <- trainControl(method = "repeatedcv", 
                                  number = 5, 
                                  repeats = 1,
                                  returnResamp = "all",
                                  verboseIter = F,
                                  allowParallel = T,
                                  sampling = "up")
                                  
  
  set.seed(123)  
  registerDoMC(cores=4)
  this.model <- train(Absenteeism ~ ., data = training_dataset, 
                      method = "rf", 
                      metric = "Accuracy",
                      preProc = c("center", "scale"),
                      trControl = trainControlRandomForest,
                      tuneLength = 50, # for random search
                      num.threads = 4
  ) 
  
  return(this.model)
}


Model.compare <- function(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "None"){
  ## This function compares the accuracy of lm.model, Glmnet.model, Tree.model and RandomForest.model
  accuracies <- c(max(LmModel$results$Accuracy), max(GlmnetModel$results$Accuracy), 
                  max(TreeModel$results$Accuracy), max(RandomForestModel$results$Accuracy))
  
  models <- c("Linear", "Glmnet", "Tree", "RandomForest")
  treatment <- rep(TreatmentDescription, 4)
  df <- data.frame(Model = models,
                   Accuracy = accuracies,
                   Treatment = treatment)
  return(df)
}

2. Recoding and variables type

This is done for better understanding of the data and to improve the visualization on the graphs.

Because we will do some recoding we first combine both dataset so that we only run the process once and avoid having differences on features between training and test dataset.

test$Absenteeism <- 0 # Test data has no target variable. 
dataset <- rbind(training, test)

dataset <- dataset[,!(names(dataset) %in% c("ID", "ID.Worker"))] # Removing ID variables. 

# We will recode variable names so that they fit better in plots
names(dataset) <- recode(names(dataset),
                         "Transportation.expense" = "TranspExp",
                         "Hit.target"  ="Hit",
                         "Social.smoker"  = "Smoker",
                         "Distance.from.Residence.to.Work" = "WorkDist",
                         "Disciplinary.failure" = "DiscFailure",
                         "Month.of.absence" = "MonthOfAbs",
                         "ID.Worker" = "WorkerID",
                         "Service.time" = "ServTime",
                         "Day.of.the.week"   = "WeekDay",
                         "Work.load.Average.day"  = "WorkLoadAvgDay",
                         "Social.drinker"  = "Drinker",
                         "Body.mass.index" = "BodyMassIndex",
                         "Reason.for.absence"  = "Reason")



dataset$Reason <- recode(dataset$Reason, 
                         '0'='InfParasDis', '1'='Neoplasms', '2'='DisOfBblood', '3'='Endoc&metDis', '4'='MentAndBehavDisor',
                         '5'='NervSys', '6'='EyeAndBdnexa', '7'='EarAndMast', '8'='CircSys', '9'='RespSys', '10'='DigSys', 
                         '11'='SkinAndSaubcutTiss', '12'='MuscuAndConnect',  '13'='Genitourinary', '14'='PregnanBirthAndPuerp', 
                         '15'='Perinatal', '16'='Congenital', '17'='AbnormalFindings', '18'='InjuryPoisoningAndOther',
                         '19'='MorbidityAndMortality', '21'='FactorsinfluencingHealth', '22'='patientFollowUp',
                         '23'='Consultation', '24'='BloodDonation', '25'='LabExamination', '26'='Unjustified',
                         '27'='Physiotherapy', '28'='Dental')

dataset$Seasons <- recode(dataset$Seasons,'1'='summer','2'='autumn','3'='winter','4'='spring')

dataset$Education <- recode(dataset$Education, '1'='highschool','2'='graduate','3'='postgraduate','4'='masterAndPhD')

dataset$WeekDay <- recode(dataset$WeekDay, '2'='Mon','3'='Tue','4'='Wed','5'='Thr', '6'='Fri')

summary(dataset)
##     Reason            MonthOfAbs       WeekDay            Seasons         
##  Length:740         Min.   : 0.000   Length:740         Length:740        
##  Class :character   1st Qu.: 3.000   Class :character   Class :character  
##  Mode  :character   Median : 6.000   Mode  :character   Mode  :character  
##                     Mean   : 6.324                                        
##                     3rd Qu.: 9.000                                        
##                     Max.   :12.000                                        
##    TranspExp        WorkDist        ServTime          Age       
##  Min.   :118.0   Min.   : 5.00   Min.   : 1.00   Min.   :27.00  
##  1st Qu.:179.0   1st Qu.:16.00   1st Qu.: 9.00   1st Qu.:31.00  
##  Median :225.0   Median :26.00   Median :13.00   Median :37.00  
##  Mean   :221.3   Mean   :29.63   Mean   :12.55   Mean   :36.45  
##  3rd Qu.:260.0   3rd Qu.:50.00   3rd Qu.:16.00   3rd Qu.:40.00  
##  Max.   :388.0   Max.   :52.00   Max.   :29.00   Max.   :58.00  
##  WorkLoadAvgDay       Hit          DiscFailure       Education        
##  Min.   :205.9   Min.   : 81.00   Min.   :0.00000   Length:740        
##  1st Qu.:244.4   1st Qu.: 93.00   1st Qu.:0.00000   Class :character  
##  Median :264.2   Median : 95.00   Median :0.00000   Mode  :character  
##  Mean   :271.5   Mean   : 94.59   Mean   :0.05405                     
##  3rd Qu.:294.2   3rd Qu.: 97.00   3rd Qu.:0.00000                     
##  Max.   :378.9   Max.   :100.00   Max.   :1.00000                     
##       Son           Drinker           Smoker             Pet        
##  Min.   :0.000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :1.000   Median :1.0000   Median :0.00000   Median :0.0000  
##  Mean   :1.019   Mean   :0.5676   Mean   :0.07297   Mean   :0.7459  
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :1.0000   Max.   :1.00000   Max.   :8.0000  
##      Weight           Height      BodyMassIndex    Absenteeism    
##  Min.   : 56.00   Min.   :163.0   Min.   :19.00   Min.   :0.0000  
##  1st Qu.: 69.00   1st Qu.:169.0   1st Qu.:24.00   1st Qu.:0.0000  
##  Median : 83.00   Median :170.0   Median :25.00   Median :0.0000  
##  Mean   : 79.04   Mean   :172.1   Mean   :26.68   Mean   :0.2946  
##  3rd Qu.: 89.00   3rd Qu.:172.0   3rd Qu.:31.00   3rd Qu.:1.0000  
##  Max.   :108.00   Max.   :196.0   Max.   :38.00   Max.   :1.0000

Check variable types (or factor or numeric).

We see that there are some columns that need to be converted to factors. booleans (0 or 1) are set as factors too.

# factor will convert the column to categorical. lapply applies the factor function to each of the columns in categorical_columns 
categorical_columns <- c('Reason', 'MonthOfAbs', 'WeekDay', 'Seasons', 'DiscFailure', 'Drinker', 'Smoker', 'Absenteeism', 'Education')
dataset[categorical_columns] <- lapply(dataset[categorical_columns], factor) 
summary(dataset)
##                    Reason      MonthOfAbs  WeekDay     Seasons   
##  Consultation         :149   3      : 87   Fri:144   autumn:192  
##  Dental               :112   2      : 72   Mon:161   spring:195  
##  Physiotherapy        : 69   10     : 71   Thr:125   summer:170  
##  Genitourinary        : 55   7      : 67   Tue:154   winter:183  
##  InfParasDis          : 43   5      : 64   Wed:156               
##  MorbidityAndMortality: 40   11     : 63                         
##  (Other)              :272   (Other):316                         
##    TranspExp        WorkDist        ServTime          Age       
##  Min.   :118.0   Min.   : 5.00   Min.   : 1.00   Min.   :27.00  
##  1st Qu.:179.0   1st Qu.:16.00   1st Qu.: 9.00   1st Qu.:31.00  
##  Median :225.0   Median :26.00   Median :13.00   Median :37.00  
##  Mean   :221.3   Mean   :29.63   Mean   :12.55   Mean   :36.45  
##  3rd Qu.:260.0   3rd Qu.:50.00   3rd Qu.:16.00   3rd Qu.:40.00  
##  Max.   :388.0   Max.   :52.00   Max.   :29.00   Max.   :58.00  
##                                                                 
##  WorkLoadAvgDay       Hit         DiscFailure        Education  
##  Min.   :205.9   Min.   : 81.00   0:700       graduate    : 46  
##  1st Qu.:244.4   1st Qu.: 93.00   1: 40       highschool  :611  
##  Median :264.2   Median : 95.00               masterAndPhD:  4  
##  Mean   :271.5   Mean   : 94.59               postgraduate: 79  
##  3rd Qu.:294.2   3rd Qu.: 97.00                                 
##  Max.   :378.9   Max.   :100.00                                 
##                                                                 
##       Son        Drinker Smoker       Pet             Weight      
##  Min.   :0.000   0:320   0:686   Min.   :0.0000   Min.   : 56.00  
##  1st Qu.:0.000   1:420   1: 54   1st Qu.:0.0000   1st Qu.: 69.00  
##  Median :1.000                   Median :0.0000   Median : 83.00  
##  Mean   :1.019                   Mean   :0.7459   Mean   : 79.04  
##  3rd Qu.:2.000                   3rd Qu.:1.0000   3rd Qu.: 89.00  
##  Max.   :4.000                   Max.   :8.0000   Max.   :108.00  
##                                                                   
##      Height      BodyMassIndex   Absenteeism
##  Min.   :163.0   Min.   :19.00   0:522      
##  1st Qu.:169.0   1st Qu.:24.00   1:218      
##  Median :170.0   Median :25.00              
##  Mean   :172.1   Mean   :26.68              
##  3rd Qu.:172.0   3rd Qu.:31.00              
##  Max.   :196.0   Max.   :38.00              
## 
str(dataset)
## 'data.frame':    740 obs. of  20 variables:
##  $ Reason        : Factor w/ 28 levels "AbnormalFindings",..: 28 14 5 9 5 22 18 22 20 20 ...
##  $ MonthOfAbs    : Factor w/ 13 levels "0","1","2","3",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ WeekDay       : Factor w/ 5 levels "Fri","Mon","Thr",..: 4 4 5 3 3 1 2 2 2 4 ...
##  $ Seasons       : Factor w/ 4 levels "autumn","spring",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ TranspExp     : int  289 118 179 279 289 361 155 235 260 260 ...
##  $ WorkDist      : int  36 13 51 5 36 52 12 11 50 50 ...
##  $ ServTime      : int  13 18 18 14 13 3 14 14 11 11 ...
##  $ Age           : int  33 50 38 39 33 28 34 37 36 36 ...
##  $ WorkLoadAvgDay: num  240 240 240 240 240 ...
##  $ Hit           : int  97 97 97 97 97 97 97 97 97 97 ...
##  $ DiscFailure   : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Education     : Factor w/ 4 levels "graduate","highschool",..: 2 2 2 2 2 2 2 4 2 2 ...
##  $ Son           : int  2 1 0 2 2 1 2 1 4 4 ...
##  $ Drinker       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
##  $ Smoker        : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
##  $ Pet           : int  1 0 0 0 1 4 0 1 0 0 ...
##  $ Weight        : int  90 98 89 68 90 80 95 88 65 65 ...
##  $ Height        : int  172 178 170 168 172 172 196 172 168 168 ...
##  $ BodyMassIndex : int  30 31 31 24 30 27 25 29 23 23 ...
##  $ Absenteeism   : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 2 2 2 ...

Lets check for duplicates in the training data and if there are, remove them to avoid overfitting.

training <- dataset[1:593,]
test <- dataset[594:740,]
print("Number of rows duplicated :")
## [1] "Number of rows duplicated :"
nrow(training) - nrow(distinct(training))
## [1] 35
training <- distinct(training)

On previous analysis we infere that the data is complete. just to double check there are no missing values or NAs:

colSums(is.na(dataset))
##         Reason     MonthOfAbs        WeekDay        Seasons      TranspExp 
##              0              0              0              0              0 
##       WorkDist       ServTime            Age WorkLoadAvgDay            Hit 
##              0              0              0              0              0 
##    DiscFailure      Education            Son        Drinker         Smoker 
##              0              0              0              0              0 
##            Pet         Weight         Height  BodyMassIndex    Absenteeism 
##              0              0              0              0              0

First Run of the Model

We will compute our models without any feature ingenieerng and compare them later with any modification we do to our data.

training <- dataset[1:593,]
test <- dataset[594:740,]

# Creating an empty df that will contain the comparisons of our models
comparisonDF <- data.frame(Model = character(), Accuracy = numeric(), Treatment = character())

# Fitting our models
LmModel <- lm.model(training) 
GlmnetModel <- Glmnet.model(training) 
TreeModel <- Tree.model(training)
RandomForestModel <- RandomForest.model(training)

comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "None"))
comparisonDF
##          Model  Accuracy Treatment
## 1       Linear 0.8246119      None
## 2       Glmnet 0.8532118      None
## 3         Tree 0.8364193      None
## 4 RandomForest 0.8297109      None

Glmnet is the model with highest Accuracy. Result is 0.853 with minimum feature engineering

3. Data visualisation

Before any feature ingeneering we will visualise our data so see if we can observe any interesing pattern.

3.1 Continuous Variables

Lets plot a histogram of all continious variables to see their distributions

training %>% 
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Lets now create a “violin” plot of continious variables to compare the behaviour of positive and negative populations on our training set. If there is a feature with very similar distribution, we might consider removing it of the analysis.

# Violin plots
DataForViolinPLot <- training %>% 
                     keep(is.numeric) %>% 
                     cbind(training$Absenteeism) %>%
                     melt()
## Using training$Absenteeism as id variables
names(DataForViolinPLot)[1] <- "Absenteeism"
ggplot(DataForViolinPLot, aes(x = Absenteeism, y = value)) +
  facet_wrap(~ variable, scales = "free") +
  geom_violin()

In this plot we can see the distribution of continious variables for each target class. We would expect that good explanatory variables have different shapes having different distribution for each class. We can see that variables such as weight for axample have nearly the same shape therefore they are not very promissing. Never the less we will compute some statistical analisys to proove our visual analisys later on.

Now we will check corretation among cont. variables

# feature

correlation <- training %>% keep(is.numeric) %>% cor(method = "pearson")
corrplot(correlation, type = 'full', method = 'circle')

We can see that weight and bodymass index are correlated, it makes sense as BMI = Kg/m^2. other correlation we see is between Son and Pets with Travel distance: this also makes sence as bigger houses are in the suburbs.

# Correlation between bodymass index and weight
print("Bodymass Index - weight pearson correlation: ")
## [1] "Bodymass Index - weight pearson correlation: "
correlation["BodyMassIndex", "Weight"]
## [1] 0.9092747

Lets boxplot all continious variables.

# Box plot of numeric variables
numVars<- dplyr::select_if(training, is.numeric)

for (i in unique(names(numVars))){
  dataForPlot <- numVars %>% select_at(i)
  plot <- ggplot(dataForPlot, aes_string(y = i)) +  stat_boxplot()
  show(plot)
}

We can see that there are some outliers in our data but nothing very extreme. This issue will be treated down the road.

3.2 Categorical Variables

First we count the number of unique values (categories) per feature

sapply(dataset %>% keep(is.factor), function(x){length(unique(x))})  
##      Reason  MonthOfAbs     WeekDay     Seasons DiscFailure   Education 
##          28          13           5           4           2           4 
##     Drinker      Smoker Absenteeism 
##           2           2           2

Observations: - 13 months. this will need to be adressed later. See if its necesary to transform and evaluate the impact of leaving 3 values with month 0. - too many reasons for absence

On the following graph we will compare the behaviour of categorical variables between false (0) and possitive (1) values.

# plot of categorical variables 

FactorVars<- dplyr::select_if(training, is.factor)

for (i in unique(names(FactorVars))[names(FactorVars) != "Absenteeism"]){
  dataForPlot <- FactorVars %>% select_at(c(i, "Absenteeism"))
  plot <- ggplot(dataForPlot, aes_string(i, "Absenteeism")) + geom_count() + coord_flip()
  show(plot)
}

First thing we see is that categories like “master and PHD” have little frequency, maybe it makes sense to group them an make only two calsses, highschool & University.

Also its seems that there are too many reasons of absenteesm, and some of them with very little frequency. We could create a goup called “Others” to gather all low frec classes (less than 1%)

There are some interesting patterns at least in drinkers, discFailure, seasson and weekday variables. On the following barplots we can observe the behaviour: - although the number of observations per season is equally distributed, we observe a significant higher amount of positive values during Winter - although the number of observations per day of the week if similarly distributed, we observe a significant higher amount of positive values on Mondays and lower on Thursdays. - there is a very low amount of employees with disciplinary failures on the training set and none of the employees with disc.failures has an absenteeism of more than 5 hours. - diferent from smokers, 60% of the employees who drinks have extended leaves.

true_values <- training[training$Absenteeism == 1,]

ggplot(data = true_values, aes(x=Seasons)) + geom_bar()

ggplot(data = true_values, aes(x=WeekDay)) + geom_bar() + ggtitle("Monday, bloody Monday....")

ggplot(data = true_values, aes(x=Drinker)) + geom_bar()

ggplot(data = true_values, aes(x=DiscFailure)) + geom_bar()

Finally, lets focus on the frequencies of Reason for absenteesm and education

reasonFreq <- freq(training$Reason, digits = 5, valid = F) %>% tibble::rownames_to_column("Reason") 
EducatioFreq <- freq(training$Education, digits = 5, valid = F) %>% tibble::rownames_to_column("Education") 

reasonFreq
##                      Reason   n        %
## 1          AbnormalFindings   1  0.16863
## 2             BloodDonation   3  0.50590
## 3                   CircSys   6  1.01180
## 4                Congenital   3  0.50590
## 5              Consultation 110 18.54975
## 6                    Dental  96 16.18887
## 7                    DigSys  23  3.87858
## 8               DisOfBblood   1  0.16863
## 9                EarAndMast  13  2.19224
## 10             Endoc&metDis   1  0.16863
## 11             EyeAndBdnexa   8  1.34907
## 12 FactorsinfluencingHealth   5  0.84317
## 13            Genitourinary  47  7.92580
## 14              InfParasDis  32  5.39629
## 15  InjuryPoisoningAndOther  17  2.86678
## 16           LabExamination  28  4.72175
## 17        MentAndBehavDisor   1  0.16863
## 18    MorbidityAndMortality  28  4.72175
## 19          MuscuAndConnect   8  1.34907
## 20                Neoplasms  14  2.36088
## 21                  NervSys   3  0.50590
## 22          patientFollowUp  26  4.38449
## 23                Perinatal   1  0.16863
## 24            Physiotherapy  54  9.10624
## 25     PregnanBirthAndPuerp  15  2.52951
## 26                  RespSys   4  0.67454
## 27       SkinAndSaubcutTiss  21  3.54132
## 28              Unjustified  24  4.04722
EducatioFreq
##      Education   n        %
## 1     graduate  36  6.07083
## 2   highschool 489 82.46206
## 3 masterAndPhD   3  0.50590
## 4 postgraduate  65 10.96121
ggplot(reasonFreq , aes(x = reorder(Reason, - `%`), y = `%`)) +    
  geom_col() +
  ylab("%") +
  xlab("Reason") +
  coord_flip()

ggplot(EducatioFreq , aes(x = reorder(Education, - `%`), y = `%`)) +    
  geom_col() +
  ylab("%") +
  xlab("Education") +
  coord_flip()

4. Outliers

As we saw in the boxPlot of continious variables, there are some outliers that might be affecting our results. Lets remove them:

for (col in names(training)) {
  if (is.numeric(training[[col]]) && col != "Absenteeism"){
    to_remove <- boxplot.stats(training[[col]], coef = 4)$out
    training_no_outliers <- training[!training[[col]] %in% to_remove, ]
    print(col)
    print(to_remove)
    print(ggplot(training_no_outliers, aes_string(y=col))+ geom_boxplot(width=0.1))
    
  }
}
## [1] "TranspExp"
## integer(0)

## [1] "WorkDist"
## integer(0)

## [1] "ServTime"
## integer(0)

## [1] "Age"
## integer(0)

## [1] "WorkLoadAvgDay"
## numeric(0)

## [1] "Hit"
## integer(0)

## [1] "Son"
## integer(0)

## [1] "Pet"
## [1] 8 8 8 8

## [1] "Weight"
## integer(0)

## [1] "Height"
##  [1] 196 185 196 196 196 196 196 185 196 196 196 196 196 196 196 185 196
## [18] 185 196 185 196 196 196 196 196 196 196

## [1] "BodyMassIndex"
## integer(0)

As we can see, for a coeficient of 4, only number of pets = 8 and heigth = 196, 185 are considered as outliers. In my opinion these values are not vey extreme, specially heigth = 196, 185. Never the less we will test our models without outliers.

# Fitting our models without outliers

LmModel <- lm.model(training_no_outliers) 
GlmnetModel <- Glmnet.model(training_no_outliers) 
TreeModel <- Tree.model(training_no_outliers)
RandomForestModel <- RandomForest.model(training_no_outliers)

comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "RemovingOutliersCoef4"))
comparisonDF
##          Model  Accuracy             Treatment
## 1       Linear 0.8246119                  None
## 2       Glmnet 0.8532118                  None
## 3         Tree 0.8364193                  None
## 4 RandomForest 0.8297109                  None
## 5       Linear 0.8246119 RemovingOutliersCoef4
## 6       Glmnet 0.8532118 RemovingOutliersCoef4
## 7         Tree 0.8364193 RemovingOutliersCoef4
## 8 RandomForest 0.8297109 RemovingOutliersCoef4
rm(training_no_outliers)

It seems that removing them has no influence in acuracy. Considering our training set has less than 600 observations and there are not significant outliers that can be atributed to a mistake while loading the information we would not consider applying lower factors than 4.

Considering the low improvement after removing the outliers, we will start the feature engineering with them and continue monitoring them.

5. Feature Engineering

5.1 Bucketize some continuous features:

train_buckson <- training
train_buckson$Son <-as.factor(ifelse(train_buckson$Son == 0, "No", "Yes"))

Run the models with bucket of pets and sons

LmModel <- lm.model(train_buckson) 
GlmnetModel <- Glmnet.model(train_buckson) 
TreeModel <- Tree.model(train_buckson)
RandomForestModel <- RandomForest.model(train_buckson)

comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "bucketizing Sons"))
comparisonDF
##           Model  Accuracy             Treatment
## 1        Linear 0.8246119                  None
## 2        Glmnet 0.8532118                  None
## 3          Tree 0.8364193                  None
## 4  RandomForest 0.8297109                  None
## 5        Linear 0.8246119 RemovingOutliersCoef4
## 6        Glmnet 0.8532118 RemovingOutliersCoef4
## 7          Tree 0.8364193 RemovingOutliersCoef4
## 8  RandomForest 0.8297109 RemovingOutliersCoef4
## 9        Linear 0.8195841      bucketizing Sons
## 10       Glmnet 0.8532260      bucketizing Sons
## 11         Tree 0.8364193      bucketizing Sons
## 12 RandomForest 0.8296966      bucketizing Sons

We observe a slight improvement on Glmnet performance so we will keep the change.

training$Son <-as.factor(ifelse(training$Son == 0, "No", "Yes"))

Lets now try with pets

train_buckpet <- training
train_buckpet$Pet <-as.factor(ifelse(train_buckpet$Pet == 0, "No", "Yes"))
  
LmModel <- lm.model(train_buckpet) 
GlmnetModel <- Glmnet.model(train_buckpet) 
TreeModel <- Tree.model(train_buckpet)
RandomForestModel <- RandomForest.model(train_buckpet)

comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "bucketizing Pets"))
comparisonDF
##           Model  Accuracy             Treatment
## 1        Linear 0.8246119                  None
## 2        Glmnet 0.8532118                  None
## 3          Tree 0.8364193                  None
## 4  RandomForest 0.8297109                  None
## 5        Linear 0.8246119 RemovingOutliersCoef4
## 6        Glmnet 0.8532118 RemovingOutliersCoef4
## 7          Tree 0.8364193 RemovingOutliersCoef4
## 8  RandomForest 0.8297109 RemovingOutliersCoef4
## 9        Linear 0.8195841      bucketizing Sons
## 10       Glmnet 0.8532260      bucketizing Sons
## 11         Tree 0.8364193      bucketizing Sons
## 12 RandomForest 0.8296966      bucketizing Sons
## 13       Linear 0.8178749      bucketizing Pets
## 14       Glmnet 0.8532260      bucketizing Pets
## 15         Tree 0.8364193      bucketizing Pets
## 16 RandomForest 0.8296824      bucketizing Pets
rm(train_buckpet)

No improvement here.. so we discard changes

5.2 Let’s group reasons with less or equal than 1% frequency

# Let's remove reasons with less or equal than 1% frequency
reasonsToGroup <- reasonFreq[reasonFreq$`%` <= 1, "Reason"]
training_buckreasson <- training
training_buckreasson$Reason <- as.character(training_buckreasson$Reason) # We have to convert to character and then back
training_buckreasson[training_buckreasson$Reason %in% reasonsToGroup, "Reason"] <- "OTHERS"
training_buckreasson$Reason <- as.factor(training_buckreasson$Reason)

freq(training_buckreasson$Reason, digits = 5, valid = F)
##                           n        %
## CircSys                   6  1.01180
## Consultation            110 18.54975
## Dental                   96 16.18887
## DigSys                   23  3.87858
## EarAndMast               13  2.19224
## EyeAndBdnexa              8  1.34907
## Genitourinary            47  7.92580
## InfParasDis              32  5.39629
## InjuryPoisoningAndOther  17  2.86678
## LabExamination           28  4.72175
## MorbidityAndMortality    28  4.72175
## MuscuAndConnect           8  1.34907
## Neoplasms                14  2.36088
## OTHERS                   23  3.87858
## patientFollowUp          26  4.38449
## Physiotherapy            54  9.10624
## PregnanBirthAndPuerp     15  2.52951
## SkinAndSaubcutTiss       21  3.54132
## Unjustified              24  4.04722
ReasonFrec <- freq(training_buckreasson$Reason, digits = 5, valid = F)
ReasonFrec <- tibble::rownames_to_column(ReasonFrec, "Reason") # La funcion frec() deja Reasons como rownames, asi los hago una columna para el ggplot
ReasonFrec
##                     Reason   n        %
## 1                  CircSys   6  1.01180
## 2             Consultation 110 18.54975
## 3                   Dental  96 16.18887
## 4                   DigSys  23  3.87858
## 5               EarAndMast  13  2.19224
## 6             EyeAndBdnexa   8  1.34907
## 7            Genitourinary  47  7.92580
## 8              InfParasDis  32  5.39629
## 9  InjuryPoisoningAndOther  17  2.86678
## 10          LabExamination  28  4.72175
## 11   MorbidityAndMortality  28  4.72175
## 12         MuscuAndConnect   8  1.34907
## 13               Neoplasms  14  2.36088
## 14                  OTHERS  23  3.87858
## 15         patientFollowUp  26  4.38449
## 16           Physiotherapy  54  9.10624
## 17    PregnanBirthAndPuerp  15  2.52951
## 18      SkinAndSaubcutTiss  21  3.54132
## 19             Unjustified  24  4.04722
ggplot(ReasonFrec , aes(x = reorder(Reason, - `%`), y = `%`)) +    
  geom_col() +
  ylab("%") +
  xlab("Reason") +
  coord_flip() + 
  ggtitle("Reasons with 'OTHERS' goup")

LmModel <- lm.model(training_buckreasson) 
GlmnetModel <- Glmnet.model(training_buckreasson) 
TreeModel <- Tree.model(training_buckreasson)
RandomForestModel <- RandomForest.model(training_buckreasson)

comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "bucketizing Reassons"))
comparisonDF
##           Model  Accuracy             Treatment
## 1        Linear 0.8246119                  None
## 2        Glmnet 0.8532118                  None
## 3          Tree 0.8364193                  None
## 4  RandomForest 0.8297109                  None
## 5        Linear 0.8246119 RemovingOutliersCoef4
## 6        Glmnet 0.8532118 RemovingOutliersCoef4
## 7          Tree 0.8364193 RemovingOutliersCoef4
## 8  RandomForest 0.8297109 RemovingOutliersCoef4
## 9        Linear 0.8195841      bucketizing Sons
## 10       Glmnet 0.8532260      bucketizing Sons
## 11         Tree 0.8364193      bucketizing Sons
## 12 RandomForest 0.8296966      bucketizing Sons
## 13       Linear 0.8178749      bucketizing Pets
## 14       Glmnet 0.8532260      bucketizing Pets
## 15         Tree 0.8364193      bucketizing Pets
## 16 RandomForest 0.8296824      bucketizing Pets
## 17       Linear 0.8195983  bucketizing Reassons
## 18       Glmnet 0.8481698  bucketizing Reassons
## 19         Tree 0.8364193  bucketizing Reassons
## 20 RandomForest 0.8279447  bucketizing Reassons
rm(training_buckreasson)

We see an improvement of Random forest Mode, but out best model Glmnet, has decreased accuracy. We decide not to make the change.

5.3 Groupby Education

training_groupEd <- training
training_groupEd$Higher_Education <- as.factor(recode(training_groupEd$Education, 'graduate'='Yes','postgraduate'='Yes','masterAndPhD'='Yes','highschool'='No'))
training_groupEd[1:594,] %>% 
  ggplot(aes(x = Higher_Education)) + geom_bar()

fitting our model with grouping on education

# Fitting our models after grouping education

LmModel <- lm.model(training_groupEd) 
GlmnetModel <- Glmnet.model(training_groupEd) 
TreeModel <- Tree.model(training_groupEd)
RandomForestModel <- RandomForest.model(training_groupEd)

comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "bucketizing Education"))
comparisonDF
##           Model  Accuracy             Treatment
## 1        Linear 0.8246119                  None
## 2        Glmnet 0.8532118                  None
## 3          Tree 0.8364193                  None
## 4  RandomForest 0.8297109                  None
## 5        Linear 0.8246119 RemovingOutliersCoef4
## 6        Glmnet 0.8532118 RemovingOutliersCoef4
## 7          Tree 0.8364193 RemovingOutliersCoef4
## 8  RandomForest 0.8297109 RemovingOutliersCoef4
## 9        Linear 0.8195841      bucketizing Sons
## 10       Glmnet 0.8532260      bucketizing Sons
## 11         Tree 0.8364193      bucketizing Sons
## 12 RandomForest 0.8296966      bucketizing Sons
## 13       Linear 0.8178749      bucketizing Pets
## 14       Glmnet 0.8532260      bucketizing Pets
## 15         Tree 0.8364193      bucketizing Pets
## 16 RandomForest 0.8296824      bucketizing Pets
## 17       Linear 0.8195983  bucketizing Reassons
## 18       Glmnet 0.8481698  bucketizing Reassons
## 19         Tree 0.8364193  bucketizing Reassons
## 20 RandomForest 0.8279447  bucketizing Reassons
## 21       Linear 0.8195841 bucketizing Education
## 22       Glmnet 0.8532260 bucketizing Education
## 23         Tree 0.8364193 bucketizing Education
## 24 RandomForest 0.8279163 bucketizing Education

The tendency is that all models improve si we keep the change and move on!

training <- training_groupEd
rm(training_groupEd)

5.4 New features

# day next to weekend (Monday or Friday)
training$MonOrFri <- as.factor(recode(training$WeekDay, '1' = 'Yes', '2' = 'No', '3' = 'No', '4'='No','5'='Yes','6'='Yes'))

# avg. mile transportation cost 
training$Transp_avg_cost <- as.numeric(training$TranspExp/training$WorkDist)

# Log of distance to work
training$log_dist <-log(training$WorkDist)
# Fitting our models after new features

LmModel <- lm.model(training) 
GlmnetModel <- Glmnet.model(training) 
TreeModel <- Tree.model(training)
RandomForestModel <- RandomForest.model(training)

comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "New features"))
comparisonDF
##           Model  Accuracy             Treatment
## 1        Linear 0.8246119                  None
## 2        Glmnet 0.8532118                  None
## 3          Tree 0.8364193                  None
## 4  RandomForest 0.8297109                  None
## 5        Linear 0.8246119 RemovingOutliersCoef4
## 6        Glmnet 0.8532118 RemovingOutliersCoef4
## 7          Tree 0.8364193 RemovingOutliersCoef4
## 8  RandomForest 0.8297109 RemovingOutliersCoef4
## 9        Linear 0.8195841      bucketizing Sons
## 10       Glmnet 0.8532260      bucketizing Sons
## 11         Tree 0.8364193      bucketizing Sons
## 12 RandomForest 0.8296966      bucketizing Sons
## 13       Linear 0.8178749      bucketizing Pets
## 14       Glmnet 0.8532260      bucketizing Pets
## 15         Tree 0.8364193      bucketizing Pets
## 16 RandomForest 0.8296824      bucketizing Pets
## 17       Linear 0.8195983  bucketizing Reassons
## 18       Glmnet 0.8481698  bucketizing Reassons
## 19         Tree 0.8364193  bucketizing Reassons
## 20 RandomForest 0.8279447  bucketizing Reassons
## 21       Linear 0.8195841 bucketizing Education
## 22       Glmnet 0.8532260 bucketizing Education
## 23         Tree 0.8364193 bucketizing Education
## 24 RandomForest 0.8279163 bucketizing Education
## 25       Linear 0.8195983          New features
## 26       Glmnet 0.8515596          New features
## 27         Tree 0.8364193          New features
## 28 RandomForest 0.8245834          New features

Not really improving!! what is next..

# Removing them
training$MonOrFri <- NULL
training$Transp_avg_cost <- NULL
training$log_dist <- NULL

5.5 More bucketing

training_morebuck <- training
training_morebuck$BodyMassIndex <-.bincode(training_morebuck$BodyMassIndex, c(18.5, 24.9, 29.99, 34.9, 39.99), TRUE, TRUE)

training_morebuck$BodyMassIndex <- as.factor(recode(training_morebuck$BodyMassIndex, '1' = "Normal", '2' = "Overweight", '3' = "Obesity", '4' = "ExtremeObesity"))

# Age & years of service
training_morebuck$Age <-as.factor(.bincode(training_morebuck$Age, c(18, 30, 45, 60), TRUE, TRUE))
# Fitting our models after new features

LmModel <- lm.model(training_morebuck) 
GlmnetModel <- Glmnet.model(training_morebuck) 
TreeModel <- Tree.model(training_morebuck)
RandomForestModel <- RandomForest.model(training_morebuck)

comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "More bucketising"))
comparisonDF
##           Model  Accuracy             Treatment
## 1        Linear 0.8246119                  None
## 2        Glmnet 0.8532118                  None
## 3          Tree 0.8364193                  None
## 4  RandomForest 0.8297109                  None
## 5        Linear 0.8246119 RemovingOutliersCoef4
## 6        Glmnet 0.8532118 RemovingOutliersCoef4
## 7          Tree 0.8364193 RemovingOutliersCoef4
## 8  RandomForest 0.8297109 RemovingOutliersCoef4
## 9        Linear 0.8195841      bucketizing Sons
## 10       Glmnet 0.8532260      bucketizing Sons
## 11         Tree 0.8364193      bucketizing Sons
## 12 RandomForest 0.8296966      bucketizing Sons
## 13       Linear 0.8178749      bucketizing Pets
## 14       Glmnet 0.8532260      bucketizing Pets
## 15         Tree 0.8364193      bucketizing Pets
## 16 RandomForest 0.8296824      bucketizing Pets
## 17       Linear 0.8195983  bucketizing Reassons
## 18       Glmnet 0.8481698  bucketizing Reassons
## 19         Tree 0.8364193  bucketizing Reassons
## 20 RandomForest 0.8279447  bucketizing Reassons
## 21       Linear 0.8195841 bucketizing Education
## 22       Glmnet 0.8532260 bucketizing Education
## 23         Tree 0.8364193 bucketizing Education
## 24 RandomForest 0.8279163 bucketizing Education
## 25       Linear 0.8195983          New features
## 26       Glmnet 0.8515596          New features
## 27         Tree 0.8364193          New features
## 28 RandomForest 0.8245834          New features
## 29       Linear 0.8212648      More bucketising
## 30       Glmnet 0.8532830      More bucketising
## 31         Tree 0.8364193      More bucketising
## 32 RandomForest 0.8314200      More bucketising

Also not improving our best model

5.6 Remove variables

Remove those features that are not significant considering the experience on this field and observations during section 3. - Month: we keep seasons - Disc.Failure: unbalanced - Education: replaced by high education - Weight and Height: info on BMI

light_train <- training[,!(names(training) %in% c("MonthOfAbs", "DiscFailure", "Education", "Weight", "Height"))]

run model with new combination of features

LmModel <- lm.model(light_train) 
GlmnetModel <- Glmnet.model(light_train) 
TreeModel <- Tree.model(light_train)
RandomForestModel <- RandomForest.model(light_train)
## note: only 45 unique complexity parameters in default grid. Truncating the grid to 45 .
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "previous FE + removing features, no FSel"))
comparisonDF
##           Model  Accuracy                                Treatment
## 1        Linear 0.8246119                                     None
## 2        Glmnet 0.8532118                                     None
## 3          Tree 0.8364193                                     None
## 4  RandomForest 0.8297109                                     None
## 5        Linear 0.8246119                    RemovingOutliersCoef4
## 6        Glmnet 0.8532118                    RemovingOutliersCoef4
## 7          Tree 0.8364193                    RemovingOutliersCoef4
## 8  RandomForest 0.8297109                    RemovingOutliersCoef4
## 9        Linear 0.8195841                         bucketizing Sons
## 10       Glmnet 0.8532260                         bucketizing Sons
## 11         Tree 0.8364193                         bucketizing Sons
## 12 RandomForest 0.8296966                         bucketizing Sons
## 13       Linear 0.8178749                         bucketizing Pets
## 14       Glmnet 0.8532260                         bucketizing Pets
## 15         Tree 0.8364193                         bucketizing Pets
## 16 RandomForest 0.8296824                         bucketizing Pets
## 17       Linear 0.8195983                     bucketizing Reassons
## 18       Glmnet 0.8481698                     bucketizing Reassons
## 19         Tree 0.8364193                     bucketizing Reassons
## 20 RandomForest 0.8279447                     bucketizing Reassons
## 21       Linear 0.8195841                    bucketizing Education
## 22       Glmnet 0.8532260                    bucketizing Education
## 23         Tree 0.8364193                    bucketizing Education
## 24 RandomForest 0.8279163                    bucketizing Education
## 25       Linear 0.8195983                             New features
## 26       Glmnet 0.8515596                             New features
## 27         Tree 0.8364193                             New features
## 28 RandomForest 0.8245834                             New features
## 29       Linear 0.8212648                         More bucketising
## 30       Glmnet 0.8532830                         More bucketising
## 31         Tree 0.8364193                         More bucketising
## 32 RandomForest 0.8314200                         More bucketising
## 33       Linear 0.8448084 previous FE + removing features, no FSel
## 34       Glmnet 0.8566016 previous FE + removing features, no FSel
## 35         Tree 0.8364193 previous FE + removing features, no FSel
## 36 RandomForest 0.8364193 previous FE + removing features, no FSel

Well it seems that removing theese variables we actually gain some accuracy. Lets keep the change then.

training <- light_train

6. Feature Importance

for this analysis we considered all the variables.

6.1 CHI ^2

training_raw <- dataset[1:593,]


chisquared <- data.frame(chi.squared(Absenteeism~., training_raw[names(training_raw)]))
chisquared$features <- rownames(chisquared)

# Plot the result, and remove those below the 1st IQR (inter-quartile-range) --aggressive
par(mfrow=c(1,2))
boxplot(chisquared$attr_importance)
bp.stats <- boxplot.stats(chisquared$attr_importance)$stats   # Get the statistics from the boxplot

chisquared.threshold = bp.stats[2]  # This element represent the 1st quartile (more on: https://www.math.ucla.edu/~anderson/rw1001/library/base/html/boxplot.stats.html).
text(y = bp.stats, labels = bp.stats, x = 1.3, cex=0.7)
barplot(sort(chisquared$attr_importance), names.arg = chisquared[order(chisquared$attr_importance),]$features, cex.names = 0.6, las=2, horiz = T)
abline(v=chisquared.threshold, col='red')  # Draw a red line over the 1st IQR

dev.off()
## null device 
##           1
####### move red line to see if it improves the results

# Determine what features to remove from the dataset.
features_to_remove <- as.character(chisquared[chisquared$attr_importance <= chisquared.threshold, "features"])
chi_squared_model = lm.model(training_raw[!names(training_raw) %in% features_to_remove])
confusionMatrix(chi_squared_model, "none")
## Cross-Validated (5 fold, repeated 1 times) Confusion Matrix 
## 
## (entries are un-normalized aggregated counts)
##  
##           Reference
## Prediction   0   1
##          0 311  32
##          1  64 186
##                             
##  Accuracy (average) : 0.8381
LmModel <- lm.model(training_raw[!names(training_raw) %in% features_to_remove]) 
GlmnetModel <- Glmnet.model(training_raw[!names(training_raw) %in% features_to_remove]) 
TreeModel <- Tree.model(training_raw[!names(training_raw) %in% features_to_remove])
RandomForestModel <- RandomForest.model(training_raw[!names(training_raw) %in% features_to_remove])

comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "with chisquared FS"))
comparisonDF
##           Model  Accuracy                                Treatment
## 1        Linear 0.8246119                                     None
## 2        Glmnet 0.8532118                                     None
## 3          Tree 0.8364193                                     None
## 4  RandomForest 0.8297109                                     None
## 5        Linear 0.8246119                    RemovingOutliersCoef4
## 6        Glmnet 0.8532118                    RemovingOutliersCoef4
## 7          Tree 0.8364193                    RemovingOutliersCoef4
## 8  RandomForest 0.8297109                    RemovingOutliersCoef4
## 9        Linear 0.8195841                         bucketizing Sons
## 10       Glmnet 0.8532260                         bucketizing Sons
## 11         Tree 0.8364193                         bucketizing Sons
## 12 RandomForest 0.8296966                         bucketizing Sons
## 13       Linear 0.8178749                         bucketizing Pets
## 14       Glmnet 0.8532260                         bucketizing Pets
## 15         Tree 0.8364193                         bucketizing Pets
## 16 RandomForest 0.8296824                         bucketizing Pets
## 17       Linear 0.8195983                     bucketizing Reassons
## 18       Glmnet 0.8481698                     bucketizing Reassons
## 19         Tree 0.8364193                     bucketizing Reassons
## 20 RandomForest 0.8279447                     bucketizing Reassons
## 21       Linear 0.8195841                    bucketizing Education
## 22       Glmnet 0.8532260                    bucketizing Education
## 23         Tree 0.8364193                    bucketizing Education
## 24 RandomForest 0.8279163                    bucketizing Education
## 25       Linear 0.8195983                             New features
## 26       Glmnet 0.8515596                             New features
## 27         Tree 0.8364193                             New features
## 28 RandomForest 0.8245834                             New features
## 29       Linear 0.8212648                         More bucketising
## 30       Glmnet 0.8532830                         More bucketising
## 31         Tree 0.8364193                         More bucketising
## 32 RandomForest 0.8314200                         More bucketising
## 33       Linear 0.8448084 previous FE + removing features, no FSel
## 34       Glmnet 0.8566016 previous FE + removing features, no FSel
## 35         Tree 0.8364193 previous FE + removing features, no FSel
## 36 RandomForest 0.8364193 previous FE + removing features, no FSel
## 37       Linear 0.8380715                       with chisquared FS
## 38       Glmnet 0.8582823                       with chisquared FS
## 39         Tree 0.8364193                       with chisquared FS
## 40 RandomForest 0.8381427                       with chisquared FS

Good results fiting the model with the important variables of chisquared model Tried different options moving the threshold: not significant results. we set it again in zero.

6.2 Information Gain Selection

weights<- data.frame(information.gain(Absenteeism~., training_raw))
weights$feature <- rownames(weights)
weights[order(weights$attr_importance, decreasing = TRUE),]
##                attr_importance        feature
## Reason             0.309119821         Reason
## WeekDay            0.024079811        WeekDay
## DiscFailure        0.023961909    DiscFailure
## TranspExp          0.022994975      TranspExp
## Son                0.019335481            Son
## MonthOfAbs         0.017304305     MonthOfAbs
## Height             0.016319739         Height
## Seasons            0.004755832        Seasons
## Smoker             0.003853220         Smoker
## Education          0.002718730      Education
## Drinker            0.002333402        Drinker
## WorkDist           0.000000000       WorkDist
## ServTime           0.000000000       ServTime
## Age                0.000000000            Age
## WorkLoadAvgDay     0.000000000 WorkLoadAvgDay
## Hit                0.000000000            Hit
## Pet                0.000000000            Pet
## Weight             0.000000000         Weight
## BodyMassIndex      0.000000000  BodyMassIndex
information_gain_features <- weights$feature[weights$attr_importance > 0.015]
information_gain_features_to_remove <- weights$feature[weights$attr_importance < 0.015]
training_gm <- training_raw[,!names(training_raw) %in% information_gain_features_to_remove]
information_gain_model = lm.model(training_gm)
confusionMatrix(information_gain_model, 'none')
## Cross-Validated (5 fold, repeated 1 times) Confusion Matrix 
## 
## (entries are un-normalized aggregated counts)
##  
##           Reference
## Prediction   0   1
##          0 311  32
##          1  64 186
##                             
##  Accuracy (average) : 0.8381
LmModel <- lm.model(training_raw[,!names(training_raw) %in% information_gain_features_to_remove]) 
GlmnetModel <- Glmnet.model(training_raw[,!names(training_raw) %in% information_gain_features_to_remove]) 
TreeModel <- Tree.model(training_raw[,!names(training_raw) %in% information_gain_features_to_remove])
RandomForestModel <- RandomForest.model(training_raw[,!names(training_raw) %in% information_gain_features_to_remove])
## note: only 46 unique complexity parameters in default grid. Truncating the grid to 46 .
comparisonDF <- rbind(comparisonDF, Model.compare(LmModel, GlmnetModel, TreeModel, RandomForestModel, TreatmentDescription = "with InformationGain FS"))
comparisonDF
##           Model  Accuracy                                Treatment
## 1        Linear 0.8246119                                     None
## 2        Glmnet 0.8532118                                     None
## 3          Tree 0.8364193                                     None
## 4  RandomForest 0.8297109                                     None
## 5        Linear 0.8246119                    RemovingOutliersCoef4
## 6        Glmnet 0.8532118                    RemovingOutliersCoef4
## 7          Tree 0.8364193                    RemovingOutliersCoef4
## 8  RandomForest 0.8297109                    RemovingOutliersCoef4
## 9        Linear 0.8195841                         bucketizing Sons
## 10       Glmnet 0.8532260                         bucketizing Sons
## 11         Tree 0.8364193                         bucketizing Sons
## 12 RandomForest 0.8296966                         bucketizing Sons
## 13       Linear 0.8178749                         bucketizing Pets
## 14       Glmnet 0.8532260                         bucketizing Pets
## 15         Tree 0.8364193                         bucketizing Pets
## 16 RandomForest 0.8296824                         bucketizing Pets
## 17       Linear 0.8195983                     bucketizing Reassons
## 18       Glmnet 0.8481698                     bucketizing Reassons
## 19         Tree 0.8364193                     bucketizing Reassons
## 20 RandomForest 0.8279447                     bucketizing Reassons
## 21       Linear 0.8195841                    bucketizing Education
## 22       Glmnet 0.8532260                    bucketizing Education
## 23         Tree 0.8364193                    bucketizing Education
## 24 RandomForest 0.8279163                    bucketizing Education
## 25       Linear 0.8195983                             New features
## 26       Glmnet 0.8515596                             New features
## 27         Tree 0.8364193                             New features
## 28 RandomForest 0.8245834                             New features
## 29       Linear 0.8212648                         More bucketising
## 30       Glmnet 0.8532830                         More bucketising
## 31         Tree 0.8364193                         More bucketising
## 32 RandomForest 0.8314200                         More bucketising
## 33       Linear 0.8448084 previous FE + removing features, no FSel
## 34       Glmnet 0.8566016 previous FE + removing features, no FSel
## 35         Tree 0.8364193 previous FE + removing features, no FSel
## 36 RandomForest 0.8364193 previous FE + removing features, no FSel
## 37       Linear 0.8380715                       with chisquared FS
## 38       Glmnet 0.8582823                       with chisquared FS
## 39         Tree 0.8364193                       with chisquared FS
## 40 RandomForest 0.8381427                       with chisquared FS
## 41       Linear 0.8380573                  with InformationGain FS
## 42       Glmnet 0.8549637                  with InformationGain FS
## 43         Tree 0.8364193                  with InformationGain FS
## 44 RandomForest 0.8329725                  with InformationGain FS

Results with features selected by X`2 are better so we discard this selection.

6.3 WRAPPER METHODS

# train_control_config_4_stepwise <- trainControl(method = "none", classProbs = TRUE, verboseIter = TRUE, allowParallel = TRUE)
# 
# #To make strings or factors valid names in R
# training_Wrapper <- sapply(training_raw, make.names)
#  
# registerDoMC(cores=4)
# backward.lm.mod <- train(Absenteeism ~ ., data = training_Wrapper, 
#                           method = "glmStepAIC", 
#                           direction = "backward",
#                           trace = FALSE,
#                           metric = "Accuracy",
#                           trControl=train_control_config_4_stepwise,
#                           num.threads = 4
#                           )
#  
#  
# paste("Features Selected" ,backward.lm.mod$finalModel$formula[3])

backward lm model took very long time to run so we decided not to include it in our analysis.

7. Final Results and Conclusions

After testing a lot of different feature ingeneering techniques we come to the conclusion that our best predictor model is the Ridge one trained without variables excluded by Chisquared (threshold = 0) test and with no Feature ingeneering to our great regret..

Results of all our trials can be seen here: ## See the evolution of the results between models and fittings.

print("Our best bet:")
## [1] "Our best bet:"
comparisonDF[comparisonDF$Accuracy == max(comparisonDF$Accuracy), ]
##     Model  Accuracy          Treatment
## 38 Glmnet 0.8582823 with chisquared FS
write.csv(comparisonDF, file = 'comparisonDF.csv', row.names = FALSE)

Train the model and generate the file with the final predictions.

# Train the model using all the data
# This time we will create a bigger grid to fine tune our hyper parameters.
# Also we will do 2 repetitions to reduce overfitting as much as we can. 
finalTrain <- training_raw[!names(training_raw) %in% features_to_remove]
finalTest <- test[!names(test) %in% features_to_remove]


Glmnet.model.tunning <- function(training_dataset) {
  trainControlLasso <- trainControl(method = "repeatedcv",
                                    number = 5,
                                    repeats = 2,
                                    returnResamp = "all",
                                    verboseIter = F,
                                    sampling = "up")

  gridGlmnet<- expand.grid(alpha = seq(0, 1, by = 0.01), 
                         lambda = seq(0.001, 0.05, by = 0.01)) 

  set.seed(123)
  this.model <- train(Absenteeism ~ ., data = training_dataset,
                           method = "glmnet",
                           metric = "Accuracy",
                           preProc = c("center", "scale"),
                           trControl = trainControlLasso,
                           tuneGrid = gridGlmnet
                           )


  return(this.model)
}

bestTuneModel <- Glmnet.model.tunning(finalTrain)

print(bestTuneModel$bestTune)
##   alpha lambda
## 5     0  0.041
finalModel <- function(training_dataset) {
  trainControlLasso <- trainControl(method = "none")

  gridGlmnet<- expand.grid(alpha = 0, 
                         lambda = 0.041) 

  
  this.model <- train(Absenteeism ~ ., data = training_dataset,
                           method = "glmnet",
                           metric = "Accuracy",
                           preProc = c("center", "scale"),
                           trControl = trainControlLasso,
                           tuneGrid = gridGlmnet
                           )


  return(this.model)
}



final.model <- finalModel(finalTrain)



# Predict the prices for the test data (i.e., we use the exp function to revert the log transformation that we applied to the target variable)
final.pred <- predict(final.model, finalTest, type = "raw")

predictions <- data.frame(ID = testId$ID, Absenteeism.time.in.hours= (final.pred))
colnames(predictions) <-c("ID", "Absenteeism")
write.csv(predictions, file = "predictions.csv", row.names = FALSE) 

Considerations for the decision on the final model

  • Linear models are simple and we are not loosing a significant level of accuracy by using them.
  • Boosts tend to overfit during the training.

Conlusions:

This model is ready to predict 85% of the Absenteeism. The most significant atribute is reason of absence as per ChiSquare analysis.

Moving forward

  • We will continue monitoring the model to see how is performing and if there are new trends to consider
  • We need to confirm if 36 employees is significant for our population (company). In case they are a low number of the total, we need to evaluate how the personal information as weight,height,age,smoker,drinker impacts our model.
  • On the upcoming we will also need to better understand some of the internal features delivered by the company: hit.target and work.load.avg.day